منابع مشابه
The Web as a Parallel Corpus
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structur...
متن کاملWeb as a Corpus
The World Wide Web offers a unique possibility to create very large and high quality text collections with low manual work necessary. In this paper, we describe requirements for usable linguistic corpus and we present a routine for building such corpus from the web pages. Several important issues that must be resolved for successful processing are discussed. As a pilot study, we create a billio...
متن کاملGoogleLing: The Web as a Linguistic Corpus
We describe software to transform any search engine or searchable corpus into a tool for linguistic research with a rich query syntax. We provide support for case sensitive searches, within-sentence and within-N-words match constraints, part-ofspeech restrictions on words, and “smart” verb-ending inflection wildcards. The software generalizes the query for the underlying search engine, and then...
متن کاملWeb as Corpus
The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to colossal quantities of text, of any number of varieties, at the click of a button, fo...
متن کاملAnnotated Web As Corpus
This paper presents a proposal to facilitate the use of the annotated web as corpus by alleviating the annotation bottleneck for corpus data drawn from the web. We describe a framework for large-scale distributed corpus annotation using peerto-peer (P2P) technology to meet this need. We also propose to annotate a large reference corpus in order to evaluate this framework. This will allow us to ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computational Linguistics
سال: 2003
ISSN: 0891-2017,1530-9312
DOI: 10.1162/089120103322711578